Projection predictive variable selection for Bayesian regularized SEM
In models with many parameters, regularization is increasingly used to prevent overfitting.
Regularized SEM adds a penalty to the parameters to regularize, e.g., ridge, lasso, or elastic net
\[ F_{regsem}(S, \Sigma(\theta)) = F(S, \Sigma(\theta)) + \lambda P(\theta_{reg}) \] See for example Jacobucci, Grimm, and McArdle (2016) or Huang (2020).
See for example Zhang and Liang (2023).
Bayesian regularized SEM is increasingly used.
A shrinkage prior takes the role of the penalty:
\[ p(\theta|y) \propto p(y|\theta)p(\theta) \]
Many different shrinkage priors exist (see e.g., Van Erp, Oberski, and Mulder (2019)).
Ideal shrinkage prior:
1. Peaked around zero
2. Heavy tails
In Bayesian regularized SEM, parameters are not automatically set to zero.
See also Piironen et al. (2018)
In Bayesian regularized SEM, parameters are not automatically set to zero.
Goal projpredSEM: To provide a formal method of parameter, i.e., model selection.
Goal: Finding a smaller submodel that predicts practically as good as the larger reference model.
Consider a simple linear regression model: \(y = X\beta + \epsilon\) with 20 predictors.
Importantly: projpred (Piironen et al. 2023) is developed for minimal subset selection.
projpred works automatically with brms or rstanarm:
Important to assess the quality of the reference model, since the predictive performance of the submodel will only be as good as the reference.
When diagnosing a reference model, there are three primary dimensions we recommend the statistician to investigate:
1. posterior sensitivity to the prior and likelihood;
2. posterior predictive checks;
3. cross-validation and the influence of data on the posterior.
McLatchie et al. (2023)
Search: determine the solution path
Evaluation: determine the final submodel size
Cross-validation is used for the evaluation part and recommended for the search part to protect against overfitting (Piironen and Vehtari 2017).
Based on a MIMIC model with \(k = 1, \ldots, K\) latent variables \(\pmb{\eta}\), \(p = 1, \ldots, P\) predictor variables \(\pmb{x}\), \(j = 1, \ldots, J\) indicator variables \(\pmb{y}\) for \(i = 1, \ldots, N\) measurements: \[ \begin{aligned} \pmb{\eta} & = \pmb{\Gamma} \pmb{x} + \pmb{\delta} \\ \pmb{y} & = \pmb{\Lambda} \pmb{\eta} + \pmb{\epsilon} \end{aligned} \]
with \(\pmb{x} \sim \pmb{N}_P (\pmb{0}, \pmb{\Phi})\), \(\pmb{\delta} \sim \pmb{N}_K(\pmb{0}, \Psi)\), and \(\pmb{\epsilon} \sim \pmb{N}_J(\pmb{0}, \pmb{\Theta})\).
Based on the model-implied covariance matrix, the predictive distribution for new responses \(\pmb{y}\) given certain values for the predictors \(\pmb{x}_0\) is multivariate normal with mean vector \(\pmb{\mu}_{y|x_0}(\hat{\pmb{\theta}})\) and covariance matrix \(\pmb{\Sigma}_{y|x_0}(\pmb{\hat{\theta}})\), which are given by: (De Rooij et al. 2023)
\[ \begin{aligned} \pmb{\hat{\mu}}_{y|x_0} & = \pmb{\hat{\mu}}_y + \pmb{\hat{\Sigma}}_{xy}^T \pmb{\hat{\Sigma}}_{xx}^{-1} (\pmb{x}_0 - \pmb{\hat{\mu}}_x) \\ \pmb{\hat{\Sigma}}_{y|x_0} & = \pmb{\hat{\Sigma}}_{yy} - \pmb{\hat{\Sigma}}_{xy}^T \pmb{\hat{\Sigma}}_{xx}^{-1} \pmb{\hat{\Sigma}}_{xy} \end{aligned} \]
Our goal is to minimize the Kullback-Leibler divergence between the predictive distribution based on the reference model and the predictive distribution based on the submodel is minimized, i.e.,
\[ \text{min}_{\pmb{\theta}_{sub}} \text{ KL} [\pmb{N}_J(\pmb{\mu}_{y|x_0}(\pmb{\theta}_{ref}), \pmb{\Sigma}_{y|x_0}(\pmb{\theta}_{ref})) \text{ || } \pmb{N}_J(\pmb{\mu}_{y|x_0}(\pmb{\theta}_{sub}), \pmb{\Sigma}_{y|x_0}(\pmb{\theta}_{sub})) ] \] Which leads to the objective function:
\[ f = \text{Tr} (\pmb{\Sigma}_p^{-1} \pmb{\Sigma}) + (\pmb{\mu} - \pmb{\mu}_p)^T \pmb{\Sigma}_p^{-1} (\pmb{\mu} - \pmb{\mu}_p) + \text{log} |\pmb{\Sigma}_p | - \text{log} |\pmb{\Sigma}|- J \]
Note: lavaan, lslx and marginal CIs handle this case well too.
ntrain <- 20
ntest <- 15
p_rel <- 2
p_zero <- 8
sim_mod <- 'F =~ .7*y1 + .7*y2 + .7*y3 + .7*y4 + .7*y5
F ~ .9*x1 + .9*x2 + 0*x3 + 0*x4 + 0*x5 + 0*x6 + 0*x7 + 0*x8 + 0*x9 + 0*x10'
dat_train <- simulateData(sim_mod, sample.nobs = ntrain, empirical = TRUE)
dat_test <- simulateData(sim_mod, sample.nobs = ntest, empirical = TRUE)Note: lslx with various penalties does not converge, only elastic net selects correct variables & estimates their effects at 0.005.
Sara van Erp